Group 1 Midterm
1 Introduction
Consumer preferences are important in decision-making for companies. The way consumers rank bundles of goods and services according to the levels of utility they are being provided with is an interesting metric to understand how companies have more power over others in a competitive market. In this project, Streaming Platforms are of interest as they play an important role in a more digital and working from home (HFM) alternative that families have nowadays. The pandemic certainly accentuated the consumption of this service and during this period it was evident the massive use of different platforms. As a way to understand this competitive market, this project focuses on understanding the patterns for each Streaming Platform, population target, various rankings for shows and how are these indicators correlated. In this sense, throughout the research it was found that the world’s largest entertainment giants have ventured on streaming entertainment, including Netflix, Hulu, Prime Video, and Disney+.
1.1 Background
Various research has been found in this regard. For instance, JustWatch is an international streaming guide that helps over 20 million users per month to find something to watch on Netflix, Prime Video, Disney+, and other streaming platforms. This search engine for digital media is available in 60 countries, and the data is based on the interest its users show in streaming services. Some analyses have been done through this platform, proving streaming catalogs, which have continuously shifted and changed over the years. This changing scenario is interesting as it is showing the shift in preferences that users have throughout time.
Additionally, several tools were found about Market shares of selected Subscription video-on-demand (SVOD) services in the United States. Statista is a combined provider of market research as well as research and analysis services, which has concluded that Netflix’s market share on the U.S. SVOD market decreased from 29 percent in 2019 to 20 percent in 2020 due to new platforms like Peacock and HBO Max entering the market last year. However, Netflix still leads the video streaming world, followed by Amazon Prime with a 16 percent market share (Statista, 2021)
In this sense, a platform that included all the features to be reviewed was chosen for the project. The source of the dataset is Kaggle Sample Dataset where it was extracted as a CSV format. The data consists of 5367 observations and 11 variables (ID, Title, Year, Age, IMDb, Rotten Tomatoes, Netflix, Hulu, Prime Video) The dataset constitutes data of various types like numerical and categorical.
1.2 Description of the dataset
In order to make the dataset manageable, only 9 variables out of the 11 provided were used. In this sense, the Title variable stands for the name of the TV show; Year refers to the when the TV show was produced; Age refers to the target age group which goes from 7+, 16+, 18+, and all; IMDb is the rating for TV shows and it is structured 1 over 10 (1/10); Rotten tomatoes is the percentage of professional critic reviews that are positive for a given film or TV show and it is structured 1 over 100 (1/100). The last four variables are the streaming platforms that are being studied in this project: Netflix, Hulu, Primer Video, Disney+, and these categorical variables respond to 1 if the show is found in the platform or 0 otherwise.
The project consists of 5368 observations, where 3089 missing values correspong to variables ages (2127) and IMDb (962).
1.3 SMART questions
What are the most targeted age groups for the TV shows by Netflix, Hulu, PrimeVideo, Disney+?
Which year published the highest number of TV shows?
Which streaming platform has the highest average rating (according to Rotten Tomatoes and IMDb)?
What is the relationship between IMDb and Rotten Tomatoes?
2 Understanding the data
2.1 Dataset Summary
Importing the Dataset assigning “NA” to all blank cells
## 'data.frame': 5368 obs. of 9 variables:
## $ Title : chr "Breaking Bad" "Stranger Things" "Attack on Titan" "Better Call Saul" ...
## $ Year : int 2008 2016 2013 2015 2017 2005 2013 2010 2011 2020 ...
## $ Age : chr "18+" "16+" "18+" "18+" ...
## $ IMDb : num 9.4 8.7 9 8.8 8.8 9.3 8.8 8.2 8.8 8.6 ...
## $ Rotten.Tomatoes: num 100 96 95 94 93 93 93 93 92 92 ...
## $ Netflix : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Hulu : int 0 0 1 0 0 0 0 0 0 0 ...
## $ Prime.Video : int 0 0 0 0 0 1 0 0 0 0 ...
## $ Disney. : int 0 0 0 0 0 0 0 0 0 0 ...
2.2 Cleaning the Dataset
Dropped variables X1 and ID
Assigned “NA” to all blank cells (specifically NA for Age)
Replacement of substrings with gsub(old, new, string) function for variables IDMb and Rotten Tomatoes
Turned the variables for the streaming platforms into as.factor()
Counted missing values for each variable
2.3 Just undertanding a little more about streaming platform variables and age
2.3.1 Age:
##
## 13+ 16+ 18+ 7+ all
## 9 995 854 831 552
2.3.2 Netflix:
##
## 0 1
## 3397 1971
2.3.3 Hulu:
##
## 0 1
## 3747 1621
2.3.4 Prime Video:
##
## 0 1
## 3537 1831
2.3.5 Disney:
##
## 0 1
## 5017 351
3 Exploratory Data Analysis
3.1 Smart Question: What are the most targeted age groups for the TV shows by Netflix, Hulu, Prime, disney Video?
people who are 16 and older are the most targeted age groups for the tv shows among the all steamming platfrom.
3.2 Smart Question: Which year published the highest number of TV shows?
The highest number of TV shows were published in 2017 (685) and 2018 (562).And the histogram is right skewed which indicates that video publication is raising while time goes forward.
3.3 Normality check for IMDb and Rotten Tomatoes
We have found the average value of IMDb and Rotten Tomatoes rating. Now, we want to check whether the samples of these two variables are normally distributed or not. If it is normally distributed the mean and median of the variable will be the same.
3.3.1 Normality check for the variables IMDb and Rotten Tomatoes for Netflix
##
## Shapiro-Wilk normality test
##
## data: netflixtv$IMDb
## W = 0.9, p-value <0.0000000000000002
##
## Shapiro-Wilk normality test
##
## data: netflixtv$Rotten.Tomatoes
## W = 1, p-value = 0.000000009
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here p value for IMDb and Rotten.Tomatoes is less than 0.05 for Netflix. Histogram of IMDb is right-skewed also Histogram of Rotten.Tomatoes is slightly left-skewed. Thus, The mean and median are not equal. IMDb and Rotten Tomatoes ratings are not normally distributed for the Netflix platform.
3.3.2 Normality check for the variables IMDb and Rotten Tomatoes for Hulu
##
## Shapiro-Wilk normality test
##
## data: hulutv$IMDb
## W = 0.9, p-value <0.0000000000000002
##
## Shapiro-Wilk normality test
##
## data: hulutv$Rotten.Tomatoes
## W = 1, p-value = 0.0000000000000006
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here p value for IMDb and Rotten.Tomatoes is less than 0.05 for Hulu. Histogram of IMDb is right-skewed and histogram of Rotten Tomatoes is slightly Bimodal. Thus, the mean and median are not equal. IMDb and Rotten Tomatoes ratings are not normally distributed for the Hulu platform.
3.3.3 Normality check for the variables IMDb and Rotten Tomatoes for Prime tv
##
## Shapiro-Wilk normality test
##
## data: primetv$IMDb
## W = 0.9, p-value <0.0000000000000002
##
## Shapiro-Wilk normality test
##
## data: primetv$Rotten.Tomatoes
## W = 0.9, p-value <0.0000000000000002
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here p value for IMDb and Rotten.Tomatoes is less than 0.05 for Prime Videos. Histogram of IMDb is right-skewed also Histogram of Rotten.Tomatoes is Randomly distribution. IMDb and Rotten Tomatoes ratings are not normally distributed for the prime tv platform.
3.3.4 Normality check for the variables IMDb and Rotten Tomatoes for Disney+
##
## Shapiro-Wilk normality test
##
## data: disneytv$IMDb
## W = 1, p-value = 0.005
##
## Shapiro-Wilk normality test
##
## data: disneytv$Rotten.Tomatoes
## W = 1, p-value = 0.00009
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Here p value for IMDb and Rotten.Tomatoes is less than 0.05 for Disney+. Histogram of IMDb is slightly right-skewed and randomly distributed. also Histogram of Rotten.Tomatoes is slightly left-skewed and randomly distribution.IMDb and Rotten Tomatoes ratings are not normally distributed for the Disney+ platform.
3.4 SMART Questions
After an initial examination of our chosen data set, we decided on three SMART questions to focus on:
What are the most targeted age groups for the TV shows by Netflix, Hulu, PrimeVideo, Disney+?
Which year published the highest number of TV shows?
Which streaming platform has the highest average rating (according to Rotten Tomatoes and IMDb)?
The first question focuses on the relation between the target age group and the platform. By looking at the distribution of the column Age for each platform, we can obtain knowledge about the intended audience for Netflix, Hulu, Prime, and Disney. The second question focuses on the column Year. We quickly noticed that the range of years listed was larger than expected, spanning from 1904 to 2021. We expected that more recent years would have more listed TV shows, but we wanted to explore that distribution in more detail. Finally, the third question looks at the IMDb and Rotten Tomatoes ratings for each platform. While there were some TV shows that were available on more than one platform, we were interested in seeing how the overall ratings were distributed for each platform.
During our exploratory data analysis, we also came up with a fourth SMART question:
What is the relationship between IMDb and Rotten Tomatoes?
We had not initially considered the idea that the IMDb and Rotten Tomatoes rating systems would differ by much, but more in-depth analysis revealed that there were significant differences between how the two systems rated shows. This created an additional dimension of analysis that we did not initially anticipate.
4 Exploratory Data Analysis
4.1 Rating Systems
4.1.1 Frequency of Ratings
First, we wanted to use exploratory data analysis (EDA) to learn more about the two rating systems: IMDb and Rotten Tomatoes.
## [1] "Mean IMDb rating:"
## [1] 7.09
## [1] "Median IMDb rating:"
## [1] 7.3
## [1] "Mode IMDb rating:"
## [1] 7.4
## [1] "Mean RT rating:"
## [1] 47.2
## [1] "Median RT rating:"
## [1] 48
## [1] "Mode RT rating:"
## [1] 10
| IMDb rating | RT rating | |
|---|---|---|
| Mean | 7.086 | 7.086 |
| Median | 7.086 | 7.086 |
| Mode | 7.086 | 7.086 |
IMDb is a rating scale from 1 to 10, but the lowest rating on this list is 1.1 and the highest is 9.6. Rotten Tomatoes rates TV shows on a scale from 1-100 with a lowest score of 10 and a highest score of 100. As seen in the histogram plot, the distribution of IMDb ratings has a slight left skew. This is further exemplified by the fact that the median (blue) is larger than the mean (red). The Rotten Tomatoes ratings mean and median are almost equal, but there are some outliers in the data at the lower end of the rating system. The mode is a 10/100 with 304 shows receiving that rating. The Rotten Tomatoes rating is a combination of critics’ ratings and audience ratings, but this data set only shows the total rating, which is a limitation of this dataset. It would have been interesting to see how critics and the audience agree or disagree about certain ratings. In comparing these ratings distribution, it became obvious that IMDb, on average, gives higher ratings than Rotten Tomatoes. IMDb has a mean of 7.09/10 (70.9%) and a median of 7.3/10 (73%) while Rotten Tomatoes has a mean of 47.2/100 (47.2%) and a median of 48/100 (48%). This discrepancy was surprising since we expected the rating systems to generally agree.
4.1.2 Comparison of Rating Systems
In order to further explore this unexpected discrepancy, we created a scatter plot comparing IMDb and Rotten Tomatoes. We also added the dimension of Age, which is the intended age group of each show. This allows us to visualize how shows for different age groups are rated.
## [1] "Mean IMDb rating, 7+:"
## [1] 7.01
## [1] "Mean RT rating, 7+:"
## [1] 55
## [1] "Mean IMDb rating, 13+:"
## [1] 6.83
## [1] "Mean RT rating, 13+:"
## [1] 54.2
## [1] "Mean IMDb rating, 16+:"
## [1] 7.25
## [1] "Mean RT rating, 16+:"
## [1] 60.3
## [1] "Mean IMDb rating, 18+:"
## [1] 7.3
## [1] "Mean RT rating, 18+:"
## [1] 62.7
## [1] "Mean IMDb rating, all:"
## [1] 6.85
## [1] "Mean RT rating, all:"
## [1] 47.7
## [1] "Mean IMDb rating, NA:"
## [1] 6.96
## [1] "Mean RT rating, NA:"
## [1] 31.7
This scatter plot further confirmed the fact that IMDb had higher ratings when compared to Rotten Tomatoes. There is a positive correlation between the two rating systems, but Rotten Tomatoes consistently has a lower overall rating. This helped us to form a new SMART question: what is the relationship between IMDb and Rotten Tomatoes?
This plot also gave us some information about how shows for different age groups are rated. Shows intended for 16+ had the highest overall rating, with an average of 7.3/10 on IMDb and 62.7/100 on Rotten Tomatoes. The 13+ age group at the lowest average IMDb score of 6.83/10 and shows intended for all ages had the lowest Rotten Tomatoes score with 47.7/100. It should be noted that shows with no intended age group listed (NA) had a lower Rotten Tomatoes score of 31.7/100, but since that represents a dearth of data concerning age group, the subset was neglected in future analysis.
4.2 Streaming Platforms
4.2.1 Ratings for Platforms
After comparing the rating systems to each other and then seeing how the intended age group affects ratings, we then turned to the different platforms. We compared the ratings for Disney, Hulu, Netflix, and Prime using boxplots.
## [1] "Mean IMDb rating, Disney:"
## [1] 6.97
## [1] "Mean RT rating, Disney:"
## [1] 49.4
## [1] "Mean IMDb rating, Hulu:"
## [1] 7.08
## [1] "Mean RT rating, Hulu:"
## [1] 52.8
## [1] "Mean IMDb rating, Netflix:"
## [1] 7.11
## [1] "Mean RT rating, Netflix:"
## [1] 53.6
## [1] "Mean IMDb rating, Prime:"
## [1] 7.15
## [1] "Mean RT rating, Prime:"
## [1] 37.8
Much like how the rating systems had different overall distributions, they also painted different pictures for the average rating of shows on each platform. Using IMDb, Prime has the highest average rating of 7.15/10 and Disney has the lowest with 6.97/10. Looking at the Rotten Tomatoes ratings, however, Netflix has the highest average rating of 53.6/100, Hulu has the highest median score of 55/100, and Prime has the lowest average rating of 37.8/100. Prime is particularly interesting in this respect since it has the highest average rating with IMDb, but the lowest average rating with Rotten Tomatoes. This demonstrates that the question of “which platform has the highest average rating?” is not so straightforward.
4.2.2 Age Groups by Platform
Next, we considered the relationship between platform and target age group. The frequency of target age groups differed significantly between different streaming platforms, thus making this a relevant feature to consider in later analysis.
## [1] 0.368
## [1] 0.309
## [1] 0.245
## [1] 0.116
Starting with Disney, the most frequent targeted age group was all ages, with 36.8% of the Disney TV shows listed falling into that category. For Hulu, 16+ was the most common age group at 30.9%. The most common age group for Netflix was 18+, comprising 24.5% of its TV shows. Prime was more evenly distributed amongst age groups (apart from 13+, which was very low), but 7+ was the most common age group at 11.6%.
After seeing how much each streaming platform differed when it comes to age group, it would have been interesting to explore other demographics for the audience of each platform. Relevant datta in this respect could include gender, race, the number of views, and the actual age of the viewer (as opposed to the target age group). Since this data set did not include these features, this can be considered another limitation of the data. Our main focus was the TV show ratings, but we could have learned more about user preference and built a more detailed model with that additional information.
##
## Call:
## lm(formula = IMDb ~ Rotten.Tomatoes * Platform, data = plot.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.513 -0.491 0.085 0.623 3.096
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 5.33532 0.23088 23.11
## Rotten.Tomatoes 0.03118 0.00427 7.29
## PlatformHulu -0.79742 0.25511 -3.13
## PlatformNetflix -0.32703 0.24834 -1.32
## PlatformPrime 0.15766 0.25105 0.63
## Rotten.Tomatoes:PlatformHulu 0.01290 0.00465 2.77
## Rotten.Tomatoes:PlatformNetflix 0.00703 0.00457 1.54
## Rotten.Tomatoes:PlatformPrime 0.00189 0.00467 0.40
## Pr(>|t|)
## (Intercept) < 0.0000000000000002 ***
## Rotten.Tomatoes 0.00000000000035 ***
## PlatformHulu 0.0018 **
## PlatformNetflix 0.1879
## PlatformPrime 0.5301
## Rotten.Tomatoes:PlatformHulu 0.0055 **
## Rotten.Tomatoes:PlatformNetflix 0.1240
## Rotten.Tomatoes:PlatformPrime 0.6865
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.974 on 4782 degrees of freedom
## (984 observations deleted due to missingness)
## Multiple R-squared: 0.241, Adjusted R-squared: 0.24
## F-statistic: 217 on 7 and 4782 DF, p-value: <0.0000000000000002
##
## Call:
## lm(formula = IMDb ~ Rotten.Tomatoes + Platform, data = plot.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.514 -0.493 0.097 0.632 2.937
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.970519 0.075666 65.69 < 0.0000000000000002 ***
## Rotten.Tomatoes 0.038133 0.000991 38.49 < 0.0000000000000002 ***
## PlatformHulu -0.089512 0.061014 -1.47 0.14
## PlatformNetflix 0.041857 0.059483 0.70 0.48
## PlatformPrime 0.268076 0.061925 4.33 0.000015 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.976 on 4785 degrees of freedom
## (984 observations deleted due to missingness)
## Multiple R-squared: 0.238, Adjusted R-squared: 0.237
## F-statistic: 373 on 4 and 4785 DF, p-value: <0.0000000000000002
##
## Call:
## lm(formula = IMDb ~ Rotten.Tomatoes, data = plot.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.549 -0.487 0.088 0.623 2.760
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.11965 0.05524 92.7 <0.0000000000000002 ***
## Rotten.Tomatoes 0.03642 0.00098 37.2 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.984 on 4788 degrees of freedom
## (984 observations deleted due to missingness)
## Multiple R-squared: 0.224, Adjusted R-squared: 0.224
## F-statistic: 1.38e+03 on 1 and 4788 DF, p-value: <0.0000000000000002
5 Distributions and tests
5.1 Smart Question: Which streaming platform has the highest average rating (according to Rotten Tomatoes and IMDb)?
On our dataset Netflix, Hulu, Prime tv, Disney+ are four streaming platforms. To find the highest average rating according to Rotten Tomatoes and IMDb among the four streaming platforms T-test is chosen for. A t-test is a type of inferential statistic used to compare the means of two groups. By conducting t-test, average rating value (mean value) of all streaming platforms has been found. Mean values from t-test are analyzed to find the highest average rating value.
##
## Welch Two Sample t-test
##
## data: netflixtv$IMDb and netflixtv$Rotten.Tomatoes
## t = -134, df = 1990, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -47.1 -45.8
## sample estimates:
## mean of x mean of y
## 7.11 53.56
##
## Welch Two Sample t-test
##
## data: hulutv$IMDb and hulutv$Rotten.Tomatoes
## t = -98, df = 1635, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -46.7 -44.8
## sample estimates:
## mean of x mean of y
## 7.08 52.84
##
## Welch Two Sample t-test
##
## data: primetv$IMDb and primetv$Rotten.Tomatoes
## t = -62, df = 1846, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -31.6 -29.6
## sample estimates:
## mean of x mean of y
## 7.15 37.76
##
## Welch Two Sample t-test
##
## data: disneytv$IMDb and disneytv$Rotten.Tomatoes
## t = -51, df = 354, p-value <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -44.1 -40.8
## sample estimates:
## mean of x mean of y
## 6.97 49.42
Prime Videos has the highest average IMDb rating which is 7.152538 among the all-streaming platforms. Netflix has the highest average Rotten Tomatoes rating which is 53.559107 among the all-streaming platforms. This is how the highest average rating is found.
5.1.1 SMART Question: Do the rating IMDb and Rotten Tomatoes depend on age variable?
We want to check the rating of streaming platforms on IMDb and Rotten Tomatoes are somehow related to the age of the audience. We conduct a chi-square test to check whether IMDb and Rotten Tomatoes are independent. H0: Age and rating are independent from each other.
H1: Age and rating are not independent from each other.
5.2 Independence check for Netflix platform
##
## 1.1 2.5 2.7 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 4.1 4.2 4.3 4.4 4.5 4.6
## 16+ 1 0 0 0 0 1 2 0 0 0 0 0 0 0 0 1 1 2
## 18+ 0 1 1 0 0 0 0 1 0 1 0 0 1 0 0 0 0 4
## 7+ 0 0 0 0 0 0 0 1 1 1 1 0 1 2 2 0 1 0
## all 0 0 0 1 2 0 0 0 1 1 1 1 0 1 0 1 1 1
##
## 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1 6.2 6.3 6.4 6.5
## 16+ 0 1 0 0 0 1 1 4 1 0 0 2 1 4 3 8 3 5 6
## 18+ 1 1 1 5 1 0 3 5 3 6 4 5 8 5 6 15 7 17 16
## 7+ 0 1 3 1 0 3 1 1 6 1 7 4 4 8 3 4 5 5 10
## all 0 1 5 2 1 2 2 4 1 2 4 1 5 4 1 8 3 6 5
##
## 6.6 6.7 6.8 6.9 7 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 8.1 8.2 8.3 8.4
## 16+ 8 13 12 11 7 17 21 13 29 21 14 24 17 13 18 19 11 16 13
## 18+ 14 13 21 16 13 17 23 17 15 23 22 16 19 27 19 9 14 14 15
## 7+ 9 11 10 10 15 10 12 11 13 15 12 5 8 10 12 12 7 9 10
## all 8 7 10 7 2 6 7 4 8 5 3 5 4 2 4 4 2 2 2
##
## 8.5 8.6 8.7 8.8 8.9 9 9.1 9.3 9.4
## 16+ 10 6 8 5 0 4 2 0 0
## 18+ 9 11 6 6 1 1 1 0 1
## 7+ 6 4 2 4 0 0 1 1 0
## all 3 3 2 0 0 0 0 1 0
##
## Pearson's Chi-squared test
##
## data: contable1
## X-squared = 259, df = 192, p-value = 0.0009
##
## 16 21 22 23 27 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
## 16+ 0 0 1 1 0 0 1 2 0 0 2 2 0 3 1 2 3 3 4 3 7 7 3 6
## 18+ 0 0 1 0 0 0 1 1 0 0 1 2 2 0 2 4 11 1 6 6 9 7 6 11
## 7+ 1 0 0 0 1 1 2 0 1 2 1 2 6 4 7 7 10 5 7 9 8 12 6 6
## all 1 2 0 0 0 1 0 4 1 2 4 5 5 4 4 18 10 5 8 6 12 9 7 5
##
## 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
## 16+ 9 4 8 7 12 10 14 17 8 11 6 12 13 6 17 9 7 15 10 9 10 7 8 8
## 18+ 9 9 13 7 16 16 9 7 19 12 11 9 19 22 8 15 13 14 10 17 8 7 11 10
## 7+ 7 8 10 12 12 8 8 12 6 8 4 5 7 8 4 9 7 7 4 7 3 8 4 3
## all 6 5 5 5 6 8 2 2 2 4 1 2 0 1 3 3 1 1 0 1 0 0 0 1
##
## 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 92 93 94 95 96
## 16+ 4 10 8 5 4 9 5 3 8 6 3 4 6 1 7 3 5 2 1 0 1 0 0 1
## 18+ 14 9 9 3 6 9 4 8 4 4 7 6 4 7 7 2 2 3 5 2 2 1 1 0
## 7+ 3 3 3 1 1 5 1 2 2 1 2 2 0 1 1 0 0 0 1 0 1 0 0 0
## all 0 0 0 2 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## 100
## 16+ 0
## 18+ 1
## 7+ 0
## all 0
##
## Pearson's Chi-squared test
##
## data: contable2
## X-squared = 448, df = 216, p-value <0.0000000000000002
Since the p-value for IMDB is 0.009 AND p-value for Rotten Tomatoes0.0000000000000002, which are lower than 0.05, we need to reject the null hypothesis. Thus, IMDb and Rotten Tomatoes ratings for Netflix are not independent. Age and rating are correlated for Netflix platform.
5.3 Independence check for Hulu platform
##
## Pearson's Chi-squared test
##
## data: contable3
## X-squared = 230, df = 207, p-value = 0.1
##
## Pearson's Chi-squared test
##
## data: contable4
## X-squared = 277, df = 210, p-value = 0.001
Since the p-value for IMDb is 0.1, which is greater than 0.05, we need to accept the null hypothesis. Thus, IMDb for Hulu is independent. Age and IMDb rating are correlated for Hulu. the p-value is 0.001, which is lower than 0.05, we need to reject the null hypothesis. Thus, Rotten Tomatoes rating for Hulu is not independent. Age and Rotten Tomatoes ratings are correlated for Hulu platform.
5.4 Independence check for Prime tv platform
##
## Pearson's Chi-squared test
##
## data: contable5
## X-squared = 186, df = 171, p-value = 0.2
##
## Pearson's Chi-squared test
##
## data: contable6
## X-squared = 320, df = 201, p-value = 0.0000002
Since the p-value for IMDb is 0.2, which is greater than 0.05, we need to accept the null hypothesis. Thus, IMDb for primetv is independent. Age and IMDb rating are correlated for primetv. the p-value is 0.0000002, which is lower than 0.05, we need to reject the null hypothesis. Thus, Rotten Tomatoes rating for primetv is not independent. Age and Rotten Tomatoes ratings are correlated for primetv platform.
5.5 Independence check for Disney+ platform
##
## Pearson's Chi-squared test
##
## data: contable7
## X-squared = 246, df = 156, p-value = 0.000006
##
## Pearson's Chi-squared test
##
## data: contable8
## X-squared = 227, df = 180, p-value = 0.01
Since the p-value for IMDB is 0.000006 AND p-value for Rotten Tomatoes 0.01, which are lower than 0.05, we need to reject the null hypothesis. Thus, IMDb and Rotten Tomatoes ratings for Diseney+ are not independent. Age and rating are correlated for Disney+ platform.
This step is to clear data, change some string data to number and some to factor.
## 'data.frame': 5368 obs. of 12 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Title : chr "Breaking Bad" "Stranger Things" "Attack on Titan" "Better Call Saul" ...
## $ Year : int 2008 2016 2013 2015 2017 2005 2013 2010 2011 2020 ...
## $ Age : Factor w/ 5 levels "13+","16+","18+",..: 3 2 3 3 2 4 3 3 3 3 ...
## $ IMDb : num 9.4 8.7 9 8.8 8.8 9.3 8.8 8.2 8.8 8.6 ...
## $ Rotten.Tomatoes: num 100 96 95 94 93 93 93 93 92 92 ...
## $ Netflix : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ Hulu : Factor w/ 2 levels "0","1": 1 1 2 1 1 1 1 1 1 1 ...
## $ Prime.Video : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 1 1 1 ...
## $ Disney. : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ Type : int 1 1 1 1 1 1 1 1 1 1 ...
6 Multiple Linear Regression Model
6.1 SMART Question: What is the relationship between IMDb and Rotten Tomatoes?
6.1.1 Preparation work
6.1.1.1 Drop NaN data
To analyze the score relationships between two platforms, we need to drop the rows that have no IMDb scores and no Rotten.Tomatoes scores. Meanwhile, the column of X, ID, Title and Type are useless, so we also drop these columns.
6.1.1.2 Make a pairs() plot with all the variables (quantitative and qualitative)
6.1.1.3 Make a corrplot() with only the numerical variables
## corrplot 0.92 loaded
6.1.2 Linear regression model build
6.1.2.1 Simple linear model
By using the variable Rotten.Tomatoes only, build a linear model with 1 independent variable to predict the IMDb.
##
## Call:
## lm(formula = IMDb ~ Rotten.Tomatoes, data = tvdata_rank)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.562 -0.495 0.087 0.628 2.746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.15419 0.05785 89.1 <0.0000000000000002 ***
## Rotten.Tomatoes 0.03590 0.00104 34.6 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.989 on 4404 degrees of freedom
## Multiple R-squared: 0.213, Adjusted R-squared: 0.213
## F-statistic: 1.19e+03 on 1 and 4404 DF, p-value: <0.0000000000000002
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 5.1542 | 0.0579 | 89.1 | 0 |
| Rotten.Tomatoes | 0.0359 | 0.0010 | 34.6 | 0 |
From the results above, We can find there is a weak correlation between IMDb and Rotten.Tomatoes. And the correlation coefficient is 0.213.
6.1.2.2 A variable is added
Because there is only a weak correlation, I try to add the other variables into the model. Below is the result:
##
## Call:
## lm(formula = IMDb ~ ., data = tvdata_rank)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.361 -0.465 0.080 0.586 2.579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.68924 3.61092 3.79 0.00015 ***
## Year -0.00449 0.00178 -2.52 0.01194 *
## Age16+ 0.14321 0.30940 0.46 0.64349
## Age18+ 0.07727 0.31010 0.25 0.80323
## Age7+ 0.11428 0.30955 0.37 0.71201
## Ageall 0.22263 0.31096 0.72 0.47409
## Rotten.Tomatoes 0.04290 0.00131 32.72 < 0.0000000000000002 ***
## Netflix1 -0.10325 0.05598 -1.84 0.06521 .
## Hulu1 -0.24195 0.05480 -4.42 0.00001 ***
## Prime.Video1 0.09515 0.05565 1.71 0.08740 .
## Disney.1 -0.24240 0.07845 -3.09 0.00202 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.923 on 3196 degrees of freedom
## (1199 observations deleted due to missingness)
## Multiple R-squared: 0.285, Adjusted R-squared: 0.283
## F-statistic: 128 on 10 and 3196 DF, p-value: <0.0000000000000002
We can find that the adjusted correlation coefficient increases, but some variables are not significant, so try to drop it.
6.1.2.3 Drop sparse age variables
##
## 13+ 16+ 18+ 7+ all
## 9 987 852 824 535
The result is that for age13+, there are only 9 shows. Too little sample cause a large p-value. So we need to drop the factor of Age13+. We also drop Netflix, Prime.Video, Disney., because the p-values of these variables are not significant.
6.1.2.4 Linear model with three variables
And then me make the third model as follow.
##
## Call:
## lm(formula = IMDb ~ ., data = tvdata_rank_no13)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.417 -0.467 0.087 0.585 2.756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.52238 3.47216 5.33 0.000000102 ***
## Year -0.00681 0.00172 -3.97 0.000074207 ***
## Age18+ -0.06490 0.04422 -1.47 0.14
## Age7+ -0.05724 0.04503 -1.27 0.20
## Ageall 0.03145 0.05450 0.58 0.56
## Rotten.Tomatoes 0.04195 0.00130 32.30 < 0.0000000000000002 ***
## Hulu1 -0.20024 0.03555 -5.63 0.000000019 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.927 on 3191 degrees of freedom
## Multiple R-squared: 0.277, Adjusted R-squared: 0.276
## F-statistic: 204 on 6 and 3191 DF, p-value: <0.0000000000000002
We can find that all variables are significant and the adjusted r-squared is 0.246, much higher than the simple linear regression.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 18.5224 | 3.4722 | 5.335 | 0.0000 |
| Year | -0.0068 | 0.0017 | -3.968 | 0.0001 |
| Age18+ | -0.0649 | 0.0442 | -1.468 | 0.1423 |
| Age7+ | -0.0572 | 0.0450 | -1.271 | 0.2038 |
| Ageall | 0.0315 | 0.0545 | 0.577 | 0.5639 |
| Rotten.Tomatoes | 0.0420 | 0.0013 | 32.300 | 0.0000 |
| Hulu1 | -0.2002 | 0.0355 | -5.633 | 0.0000 |
| Age18+ | Age7+ | Ageall | Hulu1 | Rotten.Tomatoes | Year |
|---|---|---|---|---|---|
| 1.17 | 1.42 | 1.44 | 1.54 | 1.2 | 1.1 |
This is an Error because “models were not all fitted to the same size of dataset”, so we redo the simple linear regression using tvdata_rank_no13 as model4.
6.1.2.5 Linear regression model rebuild
##
## Call:
## lm(formula = IMDb ~ Rotten.Tomatoes, data = tvdata_rank_no13)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.396 -0.459 0.075 0.581 2.811
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.7863 0.0709 67.5 <0.0000000000000002 ***
## Rotten.Tomatoes 0.0407 0.0012 34.0 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.934 on 3196 degrees of freedom
## Multiple R-squared: 0.266, Adjusted R-squared: 0.266
## F-statistic: 1.16e+03 on 1 and 3196 DF, p-value: <0.0000000000000002
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 4.7863 | 0.0709 | 67.5 | 0 |
| Rotten.Tomatoes | 0.0407 | 0.0012 | 34.0 | 0 |
The concepts above can be extended naturally to models with interactions between numeric and factor variables.
##
## Call:
## lm(formula = IMDb ~ . + Rotten.Tomatoes:Age, data = tvdata_rank_no13)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.460 -0.457 0.069 0.579 2.788
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.68621 3.51095 5.89 0.0000000042 ***
## Year -0.00781 0.00174 -4.49 0.0000073150 ***
## Age18+ -0.45763 0.20041 -2.28 0.0225 *
## Age7+ -0.52436 0.20025 -2.62 0.0089 **
## Ageall 0.58862 0.23340 2.52 0.0117 *
## Rotten.Tomatoes 0.03930 0.00218 18.03 < 0.0000000000000002 ***
## Hulu1 -0.20345 0.03549 -5.73 0.0000000108 ***
## Age18+:Rotten.Tomatoes 0.00639 0.00317 2.01 0.0440 *
## Age7+:Rotten.Tomatoes 0.00814 0.00340 2.39 0.0168 *
## Ageall:Rotten.Tomatoes -0.01237 0.00445 -2.78 0.0055 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.924 on 3188 degrees of freedom
## Multiple R-squared: 0.283, Adjusted R-squared: 0.281
## F-statistic: 140 on 9 and 3188 DF, p-value: <0.0000000000000002
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | 20.6862 | 3.5110 | 5.89 | 0.0000 |
| Year | -0.0078 | 0.0017 | -4.49 | 0.0000 |
| Age18+ | -0.4576 | 0.2004 | -2.28 | 0.0225 |
| Age7+ | -0.5244 | 0.2003 | -2.62 | 0.0089 |
| Ageall | 0.5886 | 0.2334 | 2.52 | 0.0117 |
| Rotten.Tomatoes | 0.0393 | 0.0022 | 18.03 | 0.0000 |
| Hulu1 | -0.2034 | 0.0355 | -5.73 | 0.0000 |
| Age18+:Rotten.Tomatoes | 0.0064 | 0.0032 | 2.02 | 0.0440 |
| Age7+:Rotten.Tomatoes | 0.0081 | 0.0034 | 2.39 | 0.0168 |
| Ageall:Rotten.Tomatoes | -0.0124 | 0.0045 | -2.78 | 0.0055 |
6.1.2.6 Comparation with these three models
## Analysis of Variance Table
##
## Model 1: IMDb ~ Rotten.Tomatoes
## Model 2: IMDb ~ Year + Age + Rotten.Tomatoes + Hulu
## Model 3: IMDb ~ Year + Age + Rotten.Tomatoes + Hulu + Rotten.Tomatoes:Age
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3196 2787
## 2 3191 2744 5 43.2 10.1 0.0000000013 ***
## 3 3188 2724 3 20.2 7.9 0.0000300757 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 3196 | 2787 | NA | NA | NA | NA |
| 3191 | 2744 | 5 | 43.2 | 10.1 | 0 |
| 3188 | 2724 | 3 | 20.2 | 7.9 | 0 |
Add one interaction, full model seems quite nice (although it’s still a weak correlation).
6.2 Final results and approved model for prediction of price
IMDb = 23.464467 + -0.009186 * Year + 0.046692 * Rotten.Tomatoes + -0.209429 * Hulu + (-0.516497 + 0.000526 * Rotten.Tomatoes) * Age7+ + (0.005813 + -0.007472 * Rotten.Tomatoes) * Age16+ + (-0.453837 + -0.001010 * Rotten.Tomatoes) * Age18+ + (0.609364 + -0.020359 * Rotten.Tomatoes) * Ageall
6.3 Problem and future researches
6.3.1 Weak correlation
The weak correlation between IMDb and Rotten Tomatoes means that there are other variables that we still don’t know. To find other variables that influence the results can be a focus of future researches
6.3.2 Left-skewed IMDb
As the IMDb histogram above is a little left-skewed, it means there are many outliers whose IMDb is very low in the dataset. While we built the model, we did not exclude the outliers as we considered these values important.
7 Conclusion
Some of the world’s largest entertainment giants have ventured on streaming entertainment during the previous decade, including Netflix, Hulu, Prime Video, and Disney+. As more individuals are compelled to stay at home to prevent the spread of the new coronavirus, the idea of a bored, cable-cutting consumer looking for shows, documentaries or series to watch for weeks on end has become a reality. As a result, TV shows found on Netflix, Hulu, Prime video, and Disney+ is our selected topic for the project. We completed an analysis of the rate of TV shows that have been streaming over time, the most popular streaming platform, and targeted audience will be conducted. The source of the dataset is Kaggle Sample Dataset where it was extracted as a CSV format. The data consists of 5367 observations and 11 variables (ID, Title, Year, Age, IMDb, Rotten Tomatoes, Netflix, Hulu, Prime Video) The dataset constitutes data of various types like numerical and categorical.
SMART QUESTIONS:
1.What are the most targeted age groups for the TV shows by Netflix, Hulu, Prime Video?
2.Which year published the highest number of TV shows?
3.Which streaming platform has the highest average rating (according to Rotten Tomatoes and IMDb)?
4.What is the relationship between IMDb and Rotten Tomatoes?
The independent variables we were considering include the target age group and the straming platform. Our depenent variable was the TV show rating, including both IMDb and Rotten Tomatoes. Our analysis provided insights into people’s preference of TV shows in different platforms.
After data cleaning and EDA on most variables in the dataset, we found that 16+ is the most targeted age for TV shows, followed by 18+, which means the most target people is adolescence and young adults.
We also found that the target age group was highly dependent on the streaming platform. Disney TV shows catered to all ages, Netflix and Hulu focused on 18+, and Prime was more varied across multiple groups.
Looking at the years during which TV shows were released, more and more TV shows are created in recent years and 2017 is the peak. There were very new shows produced during the 20th century, a majority listed were created in the past 20 years.
As we proceeded with hypothesis analysis of different platforms and ratings, we found that IMDb has higher average rating than Rotten Tomatoes and there is only a weak correlation betwween the two rating systems. There is a positive correlation between IMDb and Rotten Tomatoes, but they have different distributions overall: IMDb has a higher average and is left skewed while Rotten Tomatoes tends to be lower and has low-value outliers. In addition to that, IMDb and Rotten Tomatoes disagree about the highest rated platforms. Using IMDb, Prime has both the highest mean and median rating. According to Rotten Tomatoes, however, Prime has the lowest rating, Hulu has the highest median rating, and Netflix has the highest mean rating.
As we were exploring possible linear models, relating the rating systems to platform, age group, and year of creation, we found a correlation between age and Rotten Tomatoes using a linear model. This relation can be seen in our final results.